Dokument przedstawia wstępną analizę danych INC 5000 2019.
Przetwarzając dane korzystałem z pomocy ChatGPT.
Dane są dostępne do pobrania na stronie:
https://www.kaggle.com/datasets/mysarahmadbhat/inc-5000-companies
Do przygotowania danych wykorzystamy pakiet tidyverse, wspomagając się ggthemes i plotly do przygotowania wizualizacji. Aby przeanalizować miary asymetrii rozkładu zmiennych ilościowych, użyjemy biblioteki moments.
library(tidyverse)
library(ggthemes)
library(plotly)
library(moments)
Ładujemy dane do ramki danych inc przy użyciu funkcji read.csv:
setwd(dir = 'D:/wd/')
inc <- read.csv(file = 'INC 5000 Companies 2019.csv', header = TRUE, sep = ',',na.strings = c("","NA"))
str(inc)
## 'data.frame': 5012 obs. of 14 variables:
## $ rank : int 1 2 3 4 5 6 7 8 9 10 ...
## $ profile : chr "https://www.inc.com/profile/freestar" "https://www.inc.com/profile/freightwise" "https://www.inc.com/profile/ceces-veggie" "https://www.inc.com/profile/ladyboss" ...
## $ name : chr "Freestar" "FreightWise" "Cece's Veggie Co." "LadyBoss" ...
## $ url : chr "http://freestar.com" "http://freightwisellc.com" "http://cecesveggieco.com" "http://ladyboss.com" ...
## $ state : chr "AZ" "TN" "TX" "NM" ...
## $ revenue : chr "36.9 Million" "33.6 Million" "24.9 Million" "32.4 Million" ...
## $ growth_. : num 36680 30548 23880 21850 18166 ...
## $ industry : chr "Advertising & Marketing" "Logistics & Transportation" "Food & Beverage" "Consumer Products & Services" ...
## $ workers : int 40 39 190 57 25 742 12 72 60 37 ...
## $ previous_workers: int 5 8 10 2 6 18 1 1 10 5 ...
## $ founded : int 2015 2015 2015 2014 2014 2009 2014 2015 2008 2014 ...
## $ yrs_on_list : int 1 1 1 1 1 1 1 1 1 1 ...
## $ metro : chr "Phoenix" "Nashville" "Austin" NA ...
## $ city : chr "Phoenix" "Brentwood" "Austin" "Albuquerque" ...
Po wstępnym spojrzeniu na dane widzimy, że:
Dane zawieraja pole profile, które nie będzie podlegać dalszej analizie.
Nazwa zmiennej growth_. odbiega od reszty - należy ujednolicić nazewnictwo.
Zmienna revenue zawiera liczbę wraz z jednostką (Million) - należy ją przeanalizować i skonwertować do formatu liczbowego.
Dodatkowo zmienne rank, metro, city, state mogą być skonwertowane do typu factor, jednak nastąpi do dopiero po ich oczyszczeniu.
Na tym etapie usuniemy kolumnę profile i ustalamy nową nazwę zmiennej growth_.
inc <- select(.data = inc,-profile)
colnames(inc)[6] <- "growth"
Do analizy zduplikowanych wierszy użyjemy kombinacji funkcji sum i duplicated:
cat('Tabela INC zawiera',sum(duplicated(inc)),'zduplikowanych wierszy')
## Tabela INC zawiera 0 zduplikowanych wierszy
for (i in 1:ncol(inc)) {
cat(names(inc[i]),
" - duplikaty:",
length(subset(duplicated(inc[i]),
duplicated(inc[i]) == TRUE)),
"\n \n")
}
## rank - duplikaty: 13
##
## name - duplikaty: 0
##
## url - duplikaty: 0
##
## state - duplikaty: 4961
##
## revenue - duplikaty: 3997
##
## growth - duplikaty: 6
##
## industry - duplikaty: 4985
##
## workers - duplikaty: 4376
##
## previous_workers - duplikaty: 4569
##
## founded - duplikaty: 4929
##
## yrs_on_list - duplikaty: 4998
##
## metro - duplikaty: 4941
##
## city - duplikaty: 3454
##
sum(complete.cases(inc))
## [1] 4198
colSums(is.na(inc))
## rank name url state
## 0 0 0 0
## revenue growth industry workers
## 0 0 0 1
## previous_workers founded yrs_on_list metro
## 0 0 0 813
## city
## 0
cat('Tabela inc zawiera',nrow(inc) - sum(complete.cases(inc)),'niekompletnych rekordów' )
## Tabela inc zawiera 814 niekompletnych rekordów
Tabela zawiera 814 niekompletnych rekordów. Niekompletne dane występują w zmiennej metro i workers
rank to zmienna wyznaczająca pozycję danej firmy w rankingu INC.
cat('W zmiennej rank występuje',sum(duplicated(inc$rank)),'dupliaktów')
## W zmiennej rank występuje 13 dupliaktów
Analiza pod kątem duplikatów wykazuje, że zmienna zawiera duplikaty, tzn. w rankingu występują remisy.
Zmienna rank jest w typie int, jednak nie będzie traktowana jako zmienna numeryczna, dlatego zostanie skonwertowana do typu factor.
inc$rank <- as.factor(inc$rank)
name to zmienna zawierająca nazwę firmy w typie character. Zgodnie z wcześniejszymi analizami, zmienna nie posiada duplikatów ani wartości pustych.
Przykładowe wartości:
head(inc$name)
## [1] "Freestar" "FreightWise" "Cece's Veggie Co."
## [4] "LadyBoss" "Perpay" "Cano Health"
url to zmienna zawierająca adres URL witryny firmy. Zmienna nie posiada duplikatów ani wartości pustych.
Przykładowe wartości:
head(inc$url)
## [1] "http://freestar.com" "http://freightwisellc.com"
## [3] "http://cecesveggieco.com" "http://ladyboss.com"
## [5] "http://perpay.com" "http://canohealth.com"
Zmienna state zawiera informację o stanie w USA w jakim mieści się dana firma zapisane w postaci kombinacji dwóch dużych liter.
Zmienna ta zostanie skonwertowana do typu factor na potrzeby dalszych analiz. Poniższy kod zwraca również unikatowe wartości zmiennej state:
unique(inc$state)
## [1] "AZ" "TN" "TX" "NM" "PA" "FL" "NJ" "VA" "OH" "CA" "CO" "WA" "NY" "GA" "KY"
## [16] "ID" "UT" "MT" "NC" "WI" "MI" "MD" "MN" "AL" "MA" "ME" "KS" "IL" "CT" "NV"
## [31] "ND" "NH" "MO" "IN" "DC" "NE" "WY" "SC" "LA" "PR" "OR" "IA" "OK" "AR" "DE"
## [46] "SD" "WV" "MS" "VT" "HI" "RI"
inc$state <- as.factor(inc$state)
revenue zawiera informacje o przychodzie firmy wraz z jednostką (milion lub bilion - tzn. w notacji amerykańskiej: miliard).
head(inc$revenue)
## [1] "36.9 Million" "33.6 Million" "24.9 Million" "32.4 Million"
## [5] "22.5 Million" "271.8 Million"
Obecny format zmiennej uniemożliwia podjęcie analiz ilościowych, dlatego zostanie ona rozdzielona na dwie zmienne pomocnicze - revenue_value i revenue_unit:
inc <- separate(data = inc,col = revenue,into = c("revenue_value","revenue_unit"),sep = " " )
inc$revenue_value <- as.numeric(inc$revenue_value)
Następnie zmienna zostanie skonwertowana do jednej jednostki (milionów dolarów).
Na początku upewnijmy się, że zmienna posiada faktycznie dwie unikatowe wartości:
unique(inc$revenue_unit)
## [1] "Million" "Billion"
Instrukcja warunkowa tworzy nowy wektor revenue na podstawie wartości w komórkach revenue_value i revenue_unit:
inc$revenue <- ifelse(test = inc$revenue_unit == "Billion",
yes = inc$revenue_value * 1000,
no = inc$revenue_value)
Ostatecznie zmienne pomocnicze zostaną usunięte:
inc <- select(.data = inc,rank,name,url,state,revenue,growth,industry,
workers,previous_workers,founded,yrs_on_list,metro,city)
str(inc)
## 'data.frame': 5012 obs. of 13 variables:
## $ rank : Factor w/ 4999 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ name : chr "Freestar" "FreightWise" "Cece's Veggie Co." "LadyBoss" ...
## $ url : chr "http://freestar.com" "http://freightwisellc.com" "http://cecesveggieco.com" "http://ladyboss.com" ...
## $ state : Factor w/ 51 levels "AL","AR","AZ",..: 3 43 44 32 38 9 31 46 35 4 ...
## $ revenue : num 36.9 33.6 24.9 32.4 22.5 ...
## $ growth : num 36680 30548 23880 21850 18166 ...
## $ industry : chr "Advertising & Marketing" "Logistics & Transportation" "Food & Beverage" "Consumer Products & Services" ...
## $ workers : int 40 39 190 57 25 742 12 72 60 37 ...
## $ previous_workers: int 5 8 10 2 6 18 1 1 10 5 ...
## $ founded : int 2015 2015 2015 2014 2014 2009 2014 2015 2008 2014 ...
## $ yrs_on_list : int 1 1 1 1 1 1 1 1 1 1 ...
## $ metro : chr "Phoenix" "Nashville" "Austin" NA ...
## $ city : chr "Phoenix" "Brentwood" "Austin" "Albuquerque" ...
Przyjrzyjmy się teraz statystykom opisowym dla zmiennej revenue:
summary(inc$revenue)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 4.80 10.40 47.47 27.00 21400.00
cat('\n Odchylenie standardowe zmiennej revenue wynosi',sd(inc$revenue))
##
## Odchylenie standardowe zmiennej revenue wynosi 391.3343
cat('\n Skośność rozkładu zmiennej revenue wynosi:',skewness(inc$revenue))
##
## Skośność rozkładu zmiennej revenue wynosi: 39.35647
cat('\n Kurtoza rozkładu zmiennej revenue wynosi:',kurtosis(inc$revenue))
##
## Kurtoza rozkładu zmiennej revenue wynosi: 1931.193
Rozkład zmiennej jest silnie prawostronnie skośny i leptokurtyczny. Obserwacje znacznie częściej niż w rozkładzie normalnym przyjmują wartości skrajne.
Sprawdzamy normalność rozkładu testem Kołmogorowa-Smirnova:
ks.test(x = inc$revenue,y = "pnorm", mean = mean(inc$revenue), sd = sd(inc$revenue))
## Warning in ks.test.default(x = inc$revenue, y = "pnorm", mean =
## mean(inc$revenue), : wartości powtórzone nie powinny być obecne w teście
## Kolmogorowa-Smirnowa
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: inc$revenue
## D = 0.45375, p-value < 2.2e-16
## alternative hypothesis: two-sided
Rozkład zmiennej revenue odbiega kształtem od rozkładu normalnego.
Zmienna growth przedstawia procentową wartość wzrostu w ciągu ostatnich trzech lat - zgodnie z informacją zawartą na stronach:
https://www.inc.com/inc5000/2019/top-private-companies-2019-inc5000.html
W oryginalnym zestawieniu INC 5000 dane podane są procentowo, z dokładnością do trzech miejsc po przecinku (patrz: https://www.inc.com/inc5000/2019)
Zestaw danych podanych z Kaggle zawiera błędny zapis danych jako pięciocyfrowa liczba całkowita, dlatego konieczne jest wprowadzenie modyfikacji:
inc$three_years_growth_percent <- inc$growth/1000
inc$three_years_growth_percent <- round(inc$three_years_growth_percent,digits = 3)
inc <- select(.data = inc,rank,name,url,state,revenue,three_years_growth_percent,
industry,workers,previous_workers,founded,yrs_on_list,metro,city)
Industry to zmienna określająca branżę, w której działa firma.
Poniżej zestawienie unikatowych wartości zmiennej i ilość wystąpień dla każdej z nich.
inc$industry <- as.factor(inc$industry)
freq_table <- table(inc$industry)
sorted_freq_table <- sort(freq_table, decreasing = TRUE)
print(sorted_freq_table)
##
## Business Products & Services Advertising & Marketing
## 492 489
## Software Health
## 461 356
## Construction Consumer Products & Services
## 350 315
## IT Management Financial Services
## 276 239
## Government Services Real Estate
## 236 198
## Logistics & Transportation Manufacturing
## 186 181
## Retail Human Resources
## 163 157
## Food & Beverage IT System Development
## 127 120
## Engineering Telecommunications
## 81 79
## Energy Education
## 78 70
## Insurance Security
## 70 67
## Travel & Hospitality Media
## 57 46
## Environmental Services IT Services
## 43 43
## Computer Hardware
## 32
allindustries <- as.data.frame(sorted_freq_table)
colnames(allindustries) <- c("Industry","Frequency")
industry_plot <- ggplot(allindustries,aes(x =Industry,y = Frequency,fill = Frequency)) +
geom_bar(stat = "identity") + theme_minimal() +
scale_fill_gradient(low = "darkseagreen1",high = "darkseagreen4") +
theme(axis.title= element_blank(),
legend.position = 'none',
axis.text.x = element_text(size = 8, angle = 30, hjust = 1 )) +
labs(title = 'Wykres częstości zmiennej industry')
ggplotly(industry_plot,tooltip = c("x","y"))
Zmienna zostanie skonwertowana do typu factor:
inc$industry <- as.factor(inc$industry)
Workers to zmienna określająca liczbę pracowników
summary(inc$workers)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 22.0 48.0 242.8 116.0 155000.0 1
Z podsumowania wynika, że w zestawieniu znajdują się firmy, które mają 0 pracowników.
subset(inc,workers == 0)
## rank name url
## 246 245 Tier4 Group http://tier4group.com
## 1974 1970 Synapse Business Systems synapsebsystems.com
## 4568 4556 IDS International Government Services idsinternational.com
## 4943 4931 Green Mountain Technology greenmountaintechnology.com
## state revenue three_years_growth_percent industry
## 246 GA 4.6 1.729 Business Products & Services
## 1974 VA 3.0 0.204 IT Management
## 4568 VA 44.7 0.064 Government Services
## 4943 TN 21.6 0.054 Logistics & Transportation
## workers previous_workers founded yrs_on_list metro city
## 246 0 3 2010 1 Atlanta Atlanta
## 1974 0 4 2013 1 Washington, DC Fairfax
## 4568 0 682 2006 3 Washington, DC Arlington
## 4943 0 48 1999 3 <NA> Memphis
Aby wyniki nie wpływały na dalsze analizy, ustawimy dla nich wartości NA, co wykluczy te rekordy z dalszych analiz pod kątem zmiennej workers:
inc[c(246,1974,4568,4943),"workers"] <- NA
inc[c(246,1974,4568,4943),"workers"]
## [1] NA NA NA NA
previous_ workers to zmienna określająca liczbę pracowników w poprzednim okresie. Ponieważ czas ten nie jest znany, kolumna zostanie usunięta z danych.
inc <- select(inc,-previous_workers)
Zmienna founded zawiera datę założenia firmy.
summary(inc$founded)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 2003 2009 2005 2012 2016
sort(unique(inc$founded))
## [1] 0 1869 1884 1895 1897 1899 1902 1909 1910 1914 1917 1923 1925 1927 1928
## [16] 1929 1932 1939 1941 1945 1946 1948 1949 1951 1953 1955 1956 1957 1959 1961
## [31] 1962 1963 1964 1965 1967 1968 1969 1970 1972 1973 1974 1975 1976 1977 1978
## [46] 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993
## [61] 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008
## [76] 2009 2010 2011 2012 2013 2014 2015 2016
Podsumowanie wskazuje, że do danych wkradły się błędy - na liście INC 2019 są firmy, które założone zostały w roku 0. Potrzeba zatem przyjrzeć się danym:
subset(inc,founded == 0)
## rank name url state revenue
## 4726 4714 Nassau National Cable http://nassaunationalcable.com NY 11
## three_years_growth_percent industry workers founded
## 4726 0.06 Business Products & Services 30 0
## yrs_on_list metro city
## 4726 6 New York City GREAT NECK
Po szybkim sprawdzeniu, historii firmy Nassau National Cable w internecie, odnalazłem informację, że za datę jej powstania można przyjąć rok 1950.
inc[4726,'founded'] <- 1950
freq_founded <- as.data.frame(table(inc$founded))
colnames(freq_founded) <- c('Founded','Freq')
freq_founded[order(freq_founded$Freq, decreasing = TRUE),]
## Founded Freq
## 81 2014 466
## 79 2012 440
## 80 2013 431
## 78 2011 377
## 76 2009 363
## 77 2010 360
## 75 2008 299
## 74 2007 255
## 73 2006 198
## 71 2004 188
## 72 2005 183
## 82 2015 172
## 70 2003 161
## 68 2001 133
## 69 2002 131
## 67 2000 95
## 66 1999 93
## 64 1997 68
## 63 1996 63
## 65 1998 53
## 62 1995 47
## 61 1994 36
## 60 1993 30
## 56 1989 27
## 57 1990 24
## 59 1992 24
## 58 1991 23
## 52 1985 22
## 54 1987 22
## 49 1982 20
## 55 1988 19
## 53 1986 17
## 45 1978 11
## 50 1983 11
## 51 1984 11
## 43 1976 10
## 46 1979 10
## 47 1980 10
## 48 1981 9
## 37 1969 7
## 40 1973 7
## 44 1977 6
## 20 1946 5
## 34 1965 5
## 41 1974 5
## 27 1956 4
## 38 1970 4
## 39 1972 4
## 42 1975 4
## 22 1949 3
## 25 1953 3
## 29 1959 3
## 31 1962 3
## 12 1925 2
## 14 1928 2
## 18 1941 2
## 19 1945 2
## 26 1955 2
## 33 1964 2
## 35 1967 2
## 1 1869 1
## 2 1884 1
## 3 1895 1
## 4 1897 1
## 5 1899 1
## 6 1902 1
## 7 1909 1
## 8 1910 1
## 9 1914 1
## 10 1917 1
## 11 1923 1
## 13 1927 1
## 15 1929 1
## 16 1932 1
## 17 1939 1
## 21 1948 1
## 23 1950 1
## 24 1951 1
## 28 1957 1
## 30 1961 1
## 32 1963 1
## 36 1968 1
## 83 2016 1
Zmienna yrs_on_list wskazuje, od ilu lat firma znajduje się w rankingu INC 5000.
summary(inc$yrs_on_list)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 2.000 2.814 4.000 14.000
inc$yrs_on_list <- as.factor(inc$yrs_on_list)
Zmienna przyjmuje wartości z zakresu 1 - 14
Na poniższym wykresie możemy zobaczyć, ile firm odznacza się określoną liczbą lat na liście:
freq_yrs_on_list <- as.data.frame(table(inc$yrs_on_list))
colnames(freq_yrs_on_list) <- c('years_on_list', 'frequency')
yrs_on_list_plot <- ggplot(freq_yrs_on_list, aes(x = years_on_list, y = frequency)) +
geom_col(fill = 'darkseagreen4', color = 'black') +
theme_minimal() + labs(title = 'Wykres częstości zmiennej years on list',
x = 'Lata na liście',
y = 'Częstość')
ggplotly(yrs_on_list_plot)
Zmienna metro oznacza metropolię. Zweryfikujmy, czy metropolie nie powtarzają się (np. przez różnice w zapisie)
sort(unique(inc$metro))
## [1] "Albany-Schenectady-Troy, NY"
## [2] "Allentown-Bethlehem-Easton, PA-NJ"
## [3] "Ann Arbor, MI"
## [4] "Asheville, NC"
## [5] "Atlanta"
## [6] "Austin"
## [7] "Baltimore"
## [8] "Baton Rouge, LA"
## [9] "Birmingham, AL"
## [10] "Boise City-Nampa, ID"
## [11] "Boston"
## [12] "Boulder, CO"
## [13] "Bridgeport-Stamford-Norwalk, CT"
## [14] "Charleston, SC"
## [15] "Charlotte"
## [16] "Chicago"
## [17] "Cincinnati"
## [18] "Cleveland"
## [19] "Columbia, SC"
## [20] "Columbus, OH"
## [21] "Dallas"
## [22] "Denver"
## [23] "Des Moines, IA"
## [24] "Detroit"
## [25] "Durham, NC"
## [26] "Greenville-Mauldin-Easley, SC"
## [27] "Houston"
## [28] "Huntsville, AL"
## [29] "Indianapolis, IN"
## [30] "Inland Empire, CA"
## [31] "Jacksonville, FL"
## [32] "Kansas City, MO-KS"
## [33] "Lancaster, PA"
## [34] "Las Vegas, NV"
## [35] "Los Angeles"
## [36] "Louisville/Jefferson County, KY-IN"
## [37] "Madison, WI"
## [38] "Miami"
## [39] "Milwaukee"
## [40] "Minneapolis"
## [41] "Nashville"
## [42] "New Orleans"
## [43] "New York City"
## [44] "Ogden-Clearfield, UT"
## [45] "Oklahoma City, OK"
## [46] "Omaha-Council Bluffs, NE-IA"
## [47] "Orlando, FL"
## [48] "Oxnard-Thousand Oaks-Ventura, CA"
## [49] "Philadelphia"
## [50] "Phoenix"
## [51] "Pittsburgh, PA"
## [52] "Provo-Orem, UT"
## [53] "Raleigh, NC"
## [54] "Richmond, VA"
## [55] "Rochester, NY"
## [56] "Sacramento, CA"
## [57] "Salt Lake City"
## [58] "San Antonio, TX"
## [59] "San Diego"
## [60] "San Francisco"
## [61] "San Jose"
## [62] "Santa Barbara-Santa Maria-Goleta, CA"
## [63] "Sarasota-Bradenton-Venice, FL"
## [64] "Seattle"
## [65] "Springfield, MO"
## [66] "St. Louis, MO-IL"
## [67] "Tampa"
## [68] "Tulsa, OK"
## [69] "Virginia Beach"
## [70] "Washington, DC"
Niektóre metropolie mają dodany po przecinku skrót stanu, który nie będzie potrzebny w dalszej analizie, ponieważ jest już zawarty w zmiennej state, dlatego zostanie usunięty ze zmiennej.
inc$metro <- sub(',.*','',inc$metro)
Zmienna city zawiera informacje o mieście.
Sprawdźmy jej unikatowe wartości, żeby wychwycić czy nie ma w nich powtórzeń spowodowanych różnicami w zapisie
sort(unique(inc$city))
## [1] ":Livermore" "Aberdeen" "Abilene"
## [4] "Acton" "Ada" "Addison"
## [7] "Agoura Hills" "Ahaheim" "Akron"
## [10] "Alameda" "Albany" "ALBANY"
## [13] "Albuquerque" "ALBUQUERQUE" "Alexandria"
## [16] "Aliso Viejo" "Allegan" "Allen"
## [19] "ALLEN" "Allendale" "Allentown"
## [22] "Alpena" "Alpharetta" "ALPHARETTA"
## [25] "Altamonte Springs" "Alton" "Amarillo"
## [28] "Ambler" "Ambridge" "American Fork"
## [31] "Amherst" "anaheim" "Anaheim"
## [34] "ANAHEIM" "Andover" "Ann Arbor"
## [37] "Annandale" "Annapolis" "Annapolis Junction"
## [40] "Apex" "Apollo Beach" "Arbutus"
## [43] "Argyle" "Arlignton" "Arlington"
## [46] "Arlington Heights" "Arvada" "Arvin"
## [49] "Asbury Park" "Ashburn" "ASHBURN"
## [52] "Asheville" "Ashland" "athens"
## [55] "Athens" "Atlanta" "Auburn"
## [58] "Auburn Hills" "Auburndale" "Augusta"
## [61] "Aurora" "AURORA" "Austell"
## [64] "Austin" "Aventura" "Avon"
## [67] "Avondale" "Bakersfield" "Bala Cynwyd"
## [70] "Ballwin" "Baltimore" "Bannockburn"
## [73] "Baraboo" "Barberton" "Barnesville"
## [76] "Bartlett" "Bartonville" "Batavia"
## [79] "Baton Rouge" "Battlefield" "bay shore"
## [82] "Bay Shore" "Bayside" "Beaufort"
## [85] "Beavercreek" "Beaverton" "Bedford"
## [88] "Bedford Hts" "Bedminster" "Bee Cave"
## [91] "Bee Caves" "Bell Canyon" "Belleville"
## [94] "BELLEVILLE" "Bellevue" "Bellingham"
## [97] "Bellmawr" "Belmont" "Beltsville"
## [100] "Bend" "BEND" "Bennington"
## [103] "Bensalem" "Berkeley" "Berkeley Springs"
## [106] "Berwyn" "Bessemer" "Bethel"
## [109] "Bethesda" "Bethlehem" "Beverly"
## [112] "Beverly Hills" "Billings" "bingham farms"
## [115] "Birmimgham" "Birmingham" "Bismarck"
## [118] "Blaine" "Blasdell" "Blauvelt"
## [121] "BLOOMFIELD" "Bloomfield Hills" "Bloomingdale"
## [124] "Bloomington" "Bloomsburg" "Blue Ash"
## [127] "Blue Bell" "Blue Mounds" "Blue Springs"
## [130] "Bluffdale" "Bluffton" "Boca Raton"
## [133] "Bohemia" "Boise" "Bolingbrook"
## [136] "Bon Aqua" "Boothwyn" "Boston"
## [139] "BOSTON" "Bothell" "Boulder"
## [142] "Bountiful" "Bowie" "Boynton Beach"
## [145] "Bozeman" "Bradenton" "BRADENTON"
## [148] "Braintree" "BRAINTREE" "Branchburg"
## [151] "Brandon" "Brea" "Breckenridge"
## [154] "Brecksville" "Brentwood" "Brick"
## [157] "Bridgeport" "Brisbane" "Brockton"
## [160] "Bronx" "Brookfield" "Brookhaven"
## [163] "Brooklyn" "Brooklyn Park" "Brookshire"
## [166] "Brookston" "Broomfield" "Brossard"
## [169] "Brunswick" "Buellton" "Buena Park"
## [172] "Buffalo" "Buffalo Grove" "Buford"
## [175] "BUFORD" "Buhler" "Burbank"
## [178] "BURBANK" "Burlingame" "Burlington"
## [181] "Burnsville" "Burr Ridge" "Burtonsville"
## [184] "Butler" "CA" "Calabasas"
## [187] "CALHOUN" "California" "Camarillo"
## [190] "Camas" "Camden" "Campbell"
## [193] "Campbell Hall" "Campbellsport" "Canoga Park"
## [196] "Canton" "Cape Canaveral" "Carbon Hill"
## [199] "Caribou" "Carlsbad" "Carmel"
## [202] "CARMEL" "Carrollton" "Carson"
## [205] "Cary" "Casselberry" "Castle Pines"
## [208] "Castle Rock" "Catlett" "Cedar Falls"
## [211] "Cedar Park" "CEDAR PARK" "Cedar Rapids"
## [214] "Centennial" "CENTENNIAL" "Center Point"
## [217] "Centerville" "Cerritos" "Chadds Ford"
## [220] "Chambersburg" "Chandler" "Chandler,"
## [223] "Chantilly" "CHANTILLY" "Chapel Hill"
## [226] "Charleston" "Charlotte" "Charlottesville"
## [229] "chatsworth" "Chatsworth" "Chattanooga"
## [232] "CHATTANOOGA" "Chelmsford" "Cher"
## [235] "Cherry Hill" "Chesapeake" "Chester"
## [238] "Chesterfield" "Chesterton" "Chestertown"
## [241] "Chevy Chase" "Cheyenne" "Chicago"
## [244] "Chicago Heights" "Chicao" "Chico"
## [247] "Chino" "Chino Hills" "Chula Vista"
## [250] "Cincinnati" "CINCINNATI" "Cincinnati, OH"
## [253] "Cincinnnati" "City of Industry" "Clarksburg"
## [256] "Claymont" "Clayton" "clearwater"
## [259] "Clearwater" "CLEARWATER" "Cleveland"
## [262] "Clifton Park" "Clinton Township" "Clive"
## [265] "Clovis" "Cockeysville" "Coconut Creek"
## [268] "Coconut Grove" "Coeur D Alene" "College Grove"
## [271] "College Station" "COLLEGEVILLE" "Colleyville"
## [274] "Collierville" "Colonial Beach" "Colonial Heights"
## [277] "COLONIAL HEIGHTS" "Colorad Springs" "colorado springs"
## [280] "Colorado Springs" "Columbia" "Columbus"
## [283] "Comfort" "Commerce" "Commerce Twp"
## [286] "concord" "Concord" "Conroe"
## [289] "Conshohocken" "Conway" "Copley"
## [292] "Coppell" "Coral Gables" "Coral Springs"
## [295] "CORAL SPRINGS" "Coralville" "Corona"
## [298] "Corpus Christi" "Cortland" "Costa Mesa"
## [301] "Coto de Caza" "COTTONWD HTS" "Covingtom"
## [304] "Covington" "Cranbury" "Crestview"
## [307] "Crestwood" "Crystal" "Crystal Lake"
## [310] "Culver City" "Cumming" "Cypress"
## [313] "Dahlonega" "Dakota Dunes" "Dallas"
## [316] "Danbury" "Dane" "Dania Beach"
## [319] "Danvers" "DANVERS" "Danville"
## [322] "Daphne" "DAPHNE" "Darien"
## [325] "Davidson" "Davidsonville" "Davie"
## [328] "DAVIS" "Dayton" "De Pere"
## [331] "Decatur" "Deer Park" "DEER PARK"
## [334] "Deerfield" "Deerfield Beach" "Defiance"
## [337] "Del City" "Del Mar" "Delafield"
## [340] "Delavan" "delray beach" "Delray Beach"
## [343] "Denton" "Denver" "Derwood"
## [346] "Des Moines" "Des Plaines" "Destin"
## [349] "Detroit" "DETROIT LAKES" "Dexter"
## [352] "Diamond Bar" "doral" "Doral"
## [355] "Dover" "Downers Grove" "Downingtown"
## [358] "Doylestown" "Draper" "DRAPER"
## [361] "Drexel" "Drexel Hill" "Dripping Springs"
## [364] "Duarte" "Dublin" "Dubuque"
## [367] "Dulles" "Duluth" "Dumfries"
## [370] "Dunbar" "Dundee" "Dunwoody"
## [373] "Durango" "durham" "Durham"
## [376] "DURHAM" "eagan" "Eagan"
## [379] "Eagle" "Eagle Mountain" "East Boston"
## [382] "East Brunswick" "East Flat Rock" "East Hampton"
## [385] "East Providence" "Eden" "Eden Prairie"
## [388] "Eden Prarie" "Edgewater" "Edgewood"
## [391] "Edina" "Edison" "Edmond"
## [394] "EDMOND" "Effingham" "Eighty Four"
## [397] "El Cajon" "El Dorado Hills" "EL PASO"
## [400] "El Segundo" "ELDERSBURG" "Elgin"
## [403] "Elk city" "Elk Grove Village" "Elkhart"
## [406] "Elkridge" "ELLICOTT CITY" "Elm Grove"
## [409] "elmhurst" "Elmhurst" "Elmsford"
## [412] "Elmwood Park" "Ely" "Emerald Isle"
## [415] "Emeryville" "Emmaus" "EMMAUS"
## [418] "Encinitas" "Encinitass" "Encino"
## [421] "Englewood" "Erie" "Erlanger"
## [424] "Escondido" "Eugene" "Euless"
## [427] "Evans" "Evanston" "Evergreen"
## [430] "Ewa Beach" "EWING" "Exeter"
## [433] "Exton" "Fairfax" "Fairfield"
## [436] "Fairhope" "Fairlawn" "Fall River"
## [439] "falls church" "Falls Church" "FALLS CHURCH"
## [442] "Fargo" "Farmers Branch" "Farmington"
## [445] "Farmington Hills" "fayetteville" "Fayetteville"
## [448] "FAYETTEVILLE" "FELTON" "Fenton"
## [451] "Ferndale" "Fishers" "Fletcher"
## [454] "Flint" "Florida" "Flowood"
## [457] "Flushing" "Folsom" "Fontana"
## [460] "Foothill Ranch" "Forest Hill" "Forked River"
## [463] "Fort Collins" "Fort Lauderdale" "Fort Lee"
## [466] "Fort Mill" "Fort Myers" "Fort Pierce"
## [469] "Fort Smith" "Fort Walton Beach" "Fort Washington"
## [472] "Fort Wayne" "Fort Worth" "Foster City"
## [475] "Fountain Valley" "Fox Lake" "Foxborough"
## [478] "Framingham" "FRANKFORD" "Franklin"
## [481] "Frederick" "Fredericksburg" "Freeburg"
## [484] "FREEHOLD" "Fremont" "Fresh Meadows"
## [487] "fresno" "Fresno" "FRESNO"
## [490] "Frisco" "Frontenac" "Fruita"
## [493] "Ft Collins" "Ft Worth" "Ft. Lauderdale"
## [496] "fullerton" "Fullerton" "Fulton"
## [499] "Fuquay-Varina" "Gainesville" "Gaithersburg"
## [502] "Garden City" "Garden Grove" "GARLAND"
## [505] "Garnet Valley" "Gastonia" "Geneseo"
## [508] "Genoa City" "Georgetown" "Germantown"
## [511] "Gibson" "Giddings" "Gig Harbor"
## [514] "Gilbert" "Gillette" "Glen Allen"
## [517] "Glen Burnie" "Glendale" "GLENDALE"
## [520] "Glendale Heights" "Glenview Nas" "Gold River"
## [523] "Golden" "GOLDEN VALLEY" "Goleta"
## [526] "Grafton" "Grain Valley" "Grand Junction"
## [529] "Grand Prairie" "Grand Rapids" "GRAND RAPIDS"
## [532] "Grandview" "Granite Bay" "Granite Falls"
## [535] "Grants Pass" "Granville" "Grapevine"
## [538] "Grass Valley" "GREAT NECK" "Green Bay"
## [541] "GREEN BROOK" "Greenacres" "Greenbelt"
## [544] "Greenland" "Greensboro" "Greenville"
## [547] "GREENVILLE" "Greenwich" "Greenwood"
## [550] "Greenwood Village" "Greer" "Grover Beach"
## [553] "Guilderland Center" "hacienda heights" "Hackensack"
## [556] "Hackettstown" "Haddonfield" "Hainesport"
## [559] "Halfmoon" "Haltom city" "Haltom City"
## [562] "Hamburg" "Hamden" "Hamilton"
## [565] "HAMILTON" "Hampton" "Hanahan"
## [568] "Hanover" "Harrisburg" "Harrisonburg"
## [571] "Hartford" "Hartselle" "Hauppauge"
## [574] "HAUPPAUGE" "Haverhill" "Havre de Grace"
## [577] "Hayden" "Hayward" "Heath"
## [580] "Henderson" "Hendersonville" "Henrico"
## [583] "Hermosa Beach" "Herndon" "Hialeah"
## [586] "Hiawatha" "Hickory" "Hicksville"
## [589] "High Bridge" "High Point" "highland"
## [592] "Highland Park" "Highlands Ranch" "Hilliard"
## [595] "Hillsboro" "Hillsborough" "HILLSIDE"
## [598] "Hingham" "Hinsdale" "hixson"
## [601] "Hixson" "Hoboken" "HOBOKEN"
## [604] "Hoffman Estates" "Holladay" "Holland"
## [607] "Holly Springs" "Hollywood" "Holmdel"
## [610] "HOLMEN" "Holmes" "Honesdale"
## [613] "Honolulu" "Hoover" "Hope Mills"
## [616] "Hopkinton" "Hot Springs" "Houston"
## [619] "Howell" "HOWELL" "Hudson"
## [622] "Hudson Oaks" "Hudsonville" "Humble"
## [625] "Hunt Valley" "Huntersville" "Huntington Beach"
## [628] "HUNTINGTON BEACH" "Huntsville" "HUNTSVILLE"
## [631] "Hyattsville" "Idaho Falls" "Ijamsville"
## [634] "Independence" "Indian Trail" "Indianapolis"
## [637] "Indio" "Iowa City" "Irvine"
## [640] "IRVINE" "Irving" "IRVING"
## [643] "Irving Texas" "Iselin" "Islandia"
## [646] "ISSAQUAH" "Itasca" "Jackson"
## [649] "Jacksonville" "Jacksonville Beach" "JACKSONVILLE BEACH"
## [652] "Jamul" "Jenks" "Jericho"
## [655] "Jersey City" "Johns Creek" "Johnston"
## [658] "Jordan" "Jupiter" "Kalispell"
## [661] "Kansas City" "Kasyville" "Katy"
## [664] "Kaukauna" "Kaysville" "Kearny"
## [667] "Keller" "kenilworth" "Kenilworth"
## [670] "kennesaw" "Kennesaw" "kenosha"
## [673] "Kenosha" "kent" "Kent"
## [676] "Kentwood" "Kernersville" "Key West"
## [679] "Killeen" "King" "King of Prussia"
## [682] "Kingman" "Kingston" "Kingwood"
## [685] "Kirkland" "KIRKLAND" "Kitty Hawk"
## [688] "Knightstown" "Knoxville" "Kodak"
## [691] "Kutztown" "La Crescenta" "La Grange"
## [694] "La Jolla" "La Mirada" "La Porte"
## [697] "La Quinta" "Lacey" "Laconia"
## [700] "lafayette" "Lafayette" "Lagrange"
## [703] "LaGrange" "Laguna Hills" "Laguna Niguel"
## [706] "Lahaina" "Lake Balboa" "Lake City"
## [709] "LAKE CITY" "Lake Elsinore" "Lake Forest"
## [712] "Lake Havasu City" "Lake in the Hills" "Lake Mary"
## [715] "Lake Orion" "Lake Oswego" "Lake Success"
## [718] "Lakeland" "Lakeville" "Lakeway"
## [721] "Lakewood" "LAKEWOOD" "Lakewood Ranch"
## [724] "Lambertville" "Lancaster" "Land O' Lakes"
## [727] "Landover" "Lansdale" "Lansing"
## [730] "Larchmont" "Largo" "LARGO"
## [733] "Las Vegas" "LaSalle" "Latham"
## [736] "Laurie" "Lawndale" "Lawrenceville"
## [739] "Lawton" "Layton" "Laytonsville"
## [742] "League City" "Leawood" "Lebanon"
## [745] "LEE" "Lee's Summit" "Leesburg"
## [748] "LEESBURG" "Lehi" "LEHI"
## [751] "Lehighton" "Lemont" "Lenexa"
## [754] "Lenox" "Lewiston" "Lexington"
## [757] "Lexington Park" "LEXINGTON PARK" "Liberty"
## [760] "LIBERTY LAKE" "Lighthouse Point" "Lilburn"
## [763] "Lima" "Lincoln" "Lincoln City"
## [766] "Lincolnshire" "Lindon" "Linthicum, MD 21090"
## [769] "Lisle" "Lititz" "LITTLE CHUTE"
## [772] "Little Rock" "Littleton" "Livermore"
## [775] "Livingston" "Livonia" "Lombard"
## [778] "London" "Long Beach" "Long Island City"
## [781] "Longmont" "Longwood" "Lorton"
## [784] "Los Alamitos" "Los Angeles" "Los Angeles, CA 90036"
## [787] "Los Gatos" "Louisiana" "Louisville"
## [790] "LOUISVILLE" "Louisville KY" "Loveland"
## [793] "Lowell" "Lubbock" "Lumberton"
## [796] "Luray" "Lutz" "Lynnwood"
## [799] "Macon" "Madison" "Madisonville"
## [802] "Mahwah" "Malibu" "Malvern"
## [805] "Manahawkin" "Manalapan" "manassas"
## [808] "Manassas" "MANASSAS" "Manchester"
## [811] "Mandeville" "Manhattan" "Manhattan Beach"
## [814] "Manitowoc" "Mankato" "Mansfield"
## [817] "Maple Grove" "marietta" "Marietta"
## [820] "Marina Del Rey" "Marlborough" "Marlton"
## [823] "Martinez" "Mason" "Massapequa"
## [826] "Matthews" "Maumee" "Mayfield Heights"
## [829] "McAllen" "McDonough" "McFarland"
## [832] "McKinney" "Mclean" "McLean"
## [835] "McMurray" "Meadow Vista" "Mechanicsville"
## [838] "Medford" "MEDFORD" "Media"
## [841] "Medina" "Melbourne" "Melville"
## [844] "MELVILLE" "memphis" "Memphis"
## [847] "MENDOTA HEIGHTS" "MENTOR" "Meridian"
## [850] "Mesa" "MESA" "Metairie"
## [853] "Miami" "Miami Beach" "MIAMI BEACH"
## [856] "Miami Lakes" "Miamisburg" "Michigan City"
## [859] "Middle River" "Middletown" "Midland"
## [862] "Midlothian" "Midvale" "Midwest City"
## [865] "Milford" "Milledgeville" "Millersville"
## [868] "Milpitas" "Milton" "Milwaukee"
## [871] "Milwaukie" "Minneapolis" "MINNEAPOLIS"
## [874] "Minnetonka" "MIRAMAR" "Mishawaka"
## [877] "Mission" "Mission Viejo" "Missoula"
## [880] "Mobile" "Modesto" "Mogadore"
## [883] "Mohnton" "Mokena" "Monroe"
## [886] "Monroeville" "Monsey" "Montgomery"
## [889] "Moorestown" "Mooresville" "Moorpark"
## [892] "Morgantown" "Morristown" "Morrisville"
## [895] "MORRISVILLE" "Mount Dora" "Mount Laurel"
## [898] "mount pleasant" "Mount Pleasant" "Mount Vernon"
## [901] "Mountain Top" "Mountain View" "Mountainside"
## [904] "Mt Laurel" "Mt Pleasant" "Mt Washington"
## [907] "Mt. Holly" "Mt. Pleasant" "Mt. Vernon"
## [910] "Murray" "MURRELLS INLET" "Murrieta"
## [913] "MURRIETA" "N. Huntingdon" "Nampa"
## [916] "Naperrville" "Naperville" "NAPERVILLE"
## [919] "Naples" "Nappanee" "Narberth"
## [922] "Nashua" "Nashville" "NASHVILLE"
## [925] "Natick" "Navarre" "Nazareth"
## [928] "Needham" "Neenah" "Nesconset"
## [931] "Nevada City" "New Albany" "New Bedford"
## [934] "New Braunfels" "New Haven" "New Melle"
## [937] "New Milford" "New Orleans" "NEW ORLEANS"
## [940] "New Port Richey" "New Rochelle" "NEW ULM"
## [943] "New York" "New York city" "New York City"
## [946] "Newark" "Newburgh" "Newburyport"
## [949] "Newhall" "Newnan" "Newport"
## [952] "NEWPORT" "Newport Beach" "Newport News"
## [955] "Newton" "NEWTON CENTER" "Newtown"
## [958] "Newtown Square" "Nicholasville" "NJ"
## [961] "Noblesville" "Nokomis" "Norcross"
## [964] "NORCROSS" "Norfolk" "Norman"
## [967] "Norristown" "North Andover" "North Billerica"
## [970] "North Brunswick" "North Charleston" "North Hollywood"
## [973] "North Kansas City" "North Las Vegas" "North Liberty"
## [976] "North Miami Beach" "North Myrtle Beach" "North Riverside"
## [979] "North Salt Lake" "North Smithfield" "North Venice"
## [982] "Northampton" "Northbrook" "NORTHBROOK"
## [985] "Northport" "norwalk" "Norwalk"
## [988] "Norwood" "Novato" "Novelty"
## [991] "Novi" "Nutley" "O'Fallon"
## [994] "Oak Brook" "Oak Park" "Oak Ridge"
## [997] "Oakbrook Terrace" "Oakdale" "Oakland"
## [1000] "Oakland Park" "Oakwood Village" "Ocala"
## [1003] "Ocean" "Ocean City" "Oceanside"
## [1006] "Odessa" "Ofallon" "Ogden"
## [1009] "Oklahoma City" "Olathe" "Oldsmar"
## [1012] "Olivette" "Olympia" "Omaha"
## [1015] "OMAHA" "Ontario" "Orange"
## [1018] "Orange Park" "Orem" "orlando"
## [1021] "Orlando" "Orrville" "Oshkosh"
## [1024] "Overland Park" "OVERLAND PARK" "Oviedo"
## [1027] "Owings Mills" "Owosso" "OWOSSO"
## [1030] "Oxford" "Oxnard" "Pacifica"
## [1033] "Palatine" "Palm Bay" "Palm Desert"
## [1036] "Palm Harbor" "PALM HARBOR" "Palmer"
## [1039] "Palo Alto" "Paramus" "Park City"
## [1042] "Park CIty" "Parker" "Parsippany"
## [1045] "Parsippany-Troy Hills" "Pasadena" "PASADENA"
## [1048] "Paso Robles" "Peabody" "Peachtree Corners"
## [1051] "Pearland" "Pembroke Pines" "PENN VALLEY"
## [1054] "Pennsauken" "Pensacola" "PENSACOLA"
## [1057] "Pensacola, FL" "Perry Hall" "Perrysburg"
## [1060] "Petoskey" "Petroleum" "Pewaukee"
## [1063] "Pflugerville" "Philadelphia" "Philadephia"
## [1066] "Philipsburg" "Phoenix" "PHOENIX"
## [1069] "Pikesville" "Pine Brook" "Piscataway"
## [1072] "Pittsburgh" "Pittsford" "Plain City"
## [1075] "Plainfield" "Plainsboro" "Plainview"
## [1078] "Plano" "Plant City" "Plantation"
## [1081] "Pleasant Grove" "Pleasanton" "Pleasantville"
## [1084] "Please Select" "plymouth" "Plymouth"
## [1087] "Plymouth Meeting" "Pompano Beach" "Port Huron"
## [1090] "Port Vincent" "Port Washington" "PORTAGE"
## [1093] "Porter" "Porter Ranch" "Portersville"
## [1096] "Portland" "Portsmouth" "PORTSMOUTH"
## [1099] "Poseyville" "Post Falls" "POST FALLS"
## [1102] "Potomac" "Pottstown" "Poway"
## [1105] "Powell" "Prattville" "Princeton"
## [1108] "Prospect" "Provo" "Pueblo"
## [1111] "Purcellville" "puyallup" "Quakertown"
## [1114] "Queensbury" "Quincy" "Racine"
## [1117] "Radnor" "Raleigh" "Ramsey"
## [1120] "Rancho Cordova" "Rancho Cucamonga" "Rancho Santa Fe"
## [1123] "Ranson" "Rapid City" "Reading"
## [1126] "Red Bank" "redmond" "Redmond"
## [1129] "Redondo Beach" "Redwood City" "Redwood Shores"
## [1132] "Reno" "RENO" "Renson"
## [1135] "Renton" "Reston" "Rhinebeck"
## [1138] "Rhome" "Richardson" "Richfield"
## [1141] "Richland" "richmond" "Richmond"
## [1144] "Ridgeland" "River Falls" "River Heights"
## [1147] "Riverside" "RIVERSIDE" "Riverton"
## [1150] "Roanoke" "Rochelle Park" "Rochester"
## [1153] "ROCHESTER" "Rock Hill" "Rock Island"
## [1156] "Rockford" "Rockland" "Rocklin"
## [1159] "Rockville" "ROCKVILLE" "Rockwall"
## [1162] "rocky river" "Rocky River" "Rolling Meadows"
## [1165] "RONKONKOMA" "Roseburg" "Roselle"
## [1168] "ROSEMEAD" "Rosemont" "Roseville"
## [1171] "Rosharon" "Rosslyn" "Roswell"
## [1174] "Round Rock" "Roy" "Royal Oak"
## [1177] "ROYAL OAK" "Ruston" "Rutherfordton"
## [1180] "Sacramento" "Safety Harbor" "SAFETY HARBOR"
## [1183] "Saint Augustine" "Saint Charles" "Saint George"
## [1186] "Saint Louis" "Saint Louis Park" "Saint Paul"
## [1189] "Saint Peters" "Saint Petersburg" "SAINT PETERSBURG"
## [1192] "Salem" "Saline" "Salisbury"
## [1195] "Salt Lake City" "san antonio" "San Antonio"
## [1198] "SAN ANTONIO" "San Carlos" "SAN CLEMENTE"
## [1201] "San Diego" "San Francisco" "SAN GABRIEL"
## [1204] "San Jose" "SAN JOSE" "San Juan"
## [1207] "San Juan Capistrano" "San Luis Obispo" "San Marcos"
## [1210] "San Marino" "San Mateo" "San Rafael"
## [1213] "San Ramon" "Sandusky" "Sandy"
## [1216] "Sandy Springs," "sanford" "Sanford"
## [1219] "SANFORD" "Santa Ana" "SANTA ANA"
## [1222] "Santa Barbara" "Santa Clara" "Santa Clarita"
## [1225] "Santa Cruz" "SANTA CRUZ" "Santa Fe Springs"
## [1228] "Santa Monica" "Santa Rosa" "Santa Rosa Beach"
## [1231] "Santee" "Saranac" "Sarasota"
## [1234] "SARASOTA" "Saratoga Springs" "Sausalito"
## [1237] "Savannah" "Schaumberg" "Schenectady"
## [1240] "Scott AFB" "scottsdale" "Scottsdale"
## [1243] "Seattle" "Sedona" "Seven Hills"
## [1246] "Severna Park" "Sevierville" "Sewickley"
## [1249] "SF" "Shaker Heights" "Sharon"
## [1252] "SHAWNEE" "Sheridan" "Sherman Oaks"
## [1255] "Shoreview" "Silver Spring" "Silverado"
## [1258] "Simi Valley" "Simpsonville" "Sinking Spring"
## [1261] "Sioux City" "Sioux Falls" "Skillman"
## [1264] "skokie" "Skokie" "Smithtown"
## [1267] "Smock" "Solana Beach" "Solana Bech"
## [1270] "Solon" "Somerset" "Somerville"
## [1273] "Sonoma" "South Bend" "South Coast Metro"
## [1276] "South El Monte" "South Hackensack" "South Holland"
## [1279] "SOUTH HOLLAND" "South Jordan" "South Londonderry"
## [1282] "South Miami" "South Plainfield" "South River"
## [1285] "South Salt Lake" "South San Francisco" "Southampton"
## [1288] "Southborough" "Southfield" "Southgate"
## [1291] "Southlake" "Spanish Fork" "Spanish Fort"
## [1294] "Spartanburg" "Spicewood" "Spokane"
## [1297] "Spokane Valley" "Spring" "Springfield"
## [1300] "St Louis" "ST LOUIS" "St Louis Park"
## [1303] "St Paul" "St Peters" "St Petersburg"
## [1306] "ST PETERSBURG" "St. Augustine" "St. Charles"
## [1309] "St. George" "St. James" "St. Louis"
## [1312] "St. Louis MO" "St. Louis Park" "St. Paul"
## [1315] "St. Petersburg" "St. Rose" "Stafford"
## [1318] "Stamford" "STANTON" "State College"
## [1321] "Statesville" "Steamboat Springs" "Sterling"
## [1324] "STEVENSON RANCH" "Stevensville" "Stillwater"
## [1327] "Stoneham" "Stow" "Stratford"
## [1330] "Strongsville" "Stuart" "Studio City"
## [1333] "sturgis" "Suffern" "Sugar Land"
## [1336] "Sugarland" "Suisun City" "Sulphur"
## [1339] "Summit" "Sumner" "Sun Prairie"
## [1342] "Sun Valley" "Sunnyvale" "Suwanee"
## [1345] "Swannnanoa" "Swanton" "Sykesville"
## [1348] "Syracuse" "Tabor CIty" "Tacoma"
## [1351] "TACOMA" "Tallahassee" "Tampa"
## [1354] "Taylor" "Taylorsville" "Taylorville"
## [1357] "Tea" "Teaneck" "Temecula"
## [1360] "tempe" "Tempe" "Temple"
## [1363] "Temple Terrace" "Terre Haute" "The Villages"
## [1366] "The Woodlands" "Thomasville" "Thornton"
## [1369] "Thousand Oaks" "Thousand Palms" "Tigard"
## [1372] "TIGARD" "Tinley Park" "Tinton Falls"
## [1375] "Toledo" "TOLEDO" "tomball"
## [1378] "Tomball" "TOMBALL" "Toms River"
## [1381] "Topanga" "Topeka" "Topsfield"
## [1384] "Torrance" "Torrington" "Towson"
## [1387] "TOWSON" "Travelers Rest" "Trenton"
## [1390] "Trevose" "troy" "Troy"
## [1393] "TROY" "Truckee" "Tucker"
## [1396] "tucson" "Tucson" "tulsa"
## [1399] "Tulsa" "Tumwater" "Turlock"
## [1402] "Turnersville" "Tuscaloosa" "Tustin"
## [1405] "Twin Falls" "Tyler" "TYLER"
## [1408] "Tyrone" "Tysons" "Tysons Corner"
## [1411] "Union" "Uniontown" "Upper Nyack"
## [1414] "Urbana" "Urbandale" "Valencia"
## [1417] "Valhalla" "valley cottage" "Valley Stream"
## [1420] "Valley View" "Van Nuys" "Vancouver"
## [1423] "VANCOUVER" "Venice" "Venice Beach"
## [1426] "Vernon" "Vernon Hills" "VERNON HILLS"
## [1429] "Vero Beach" "Victoria" "Vidalia"
## [1432] "Vienna" "VIenna" "Vineland"
## [1435] "Virginia Beach" "Visalia" "Vista"
## [1438] "Voorhees Township" "Waco" "Waconia"
## [1441] "wagoner" "Wakefield" "wall"
## [1444] "Wall" "Waller" "Wallingford"
## [1447] "Walnut" "Walnut Creek" "Waltham"
## [1450] "Warner Robins" "Warren" "Warrendale"
## [1453] "Warwick" "Wash" "Washington"
## [1456] "Washington D.C." "Washington, D.C." "Washington, DC"
## [1459] "waterbury" "Watertown" "Waterville"
## [1462] "Wauconda" "Waukee" "Waukesha"
## [1465] "Waunakee" "Wausau" "Waxhaw"
## [1468] "Wayne" "Weatherford" "Webster"
## [1471] "Wenatchee" "West Babylon" "West Bend"
## [1474] "West Bloomfield" "West Chester" "WEST CHESTER"
## [1477] "West Columbia" "West Des Moines" "West Fargo"
## [1480] "West Hartford" "West Haven" "West Henrietta"
## [1483] "West Hollywood" "West Jordan" "WEST LINN"
## [1486] "West Long Branch" "West Memphis" "West Newbury"
## [1489] "West Palm Beach" "West Point" "West Springfield"
## [1492] "Westborough" "Westerville" "Westford"
## [1495] "Westlake" "WESTLAKE" "Westlake Village"
## [1498] "WESTLAKE VILLAGE" "Westland" "Westminster"
## [1501] "Weston" "Westport" "Westville"
## [1504] "Westwood" "Wexford" "Wharton"
## [1507] "Wheat Ridge" "White Plains" "Whitinsville"
## [1510] "Wichita" "Williamston" "Willoughby"
## [1513] "Willow Park" "Wilmington" "Wilmington, MA"
## [1516] "Wilminton" "Wilsonville" "Wimberley"
## [1519] "Winchester" "winder" "Windham"
## [1522] "Windham, NH" "Windsor" "Windsor Mill"
## [1525] "Winooski" "Winston Salem" "Winter Garden"
## [1528] "WINTER GARDEN" "Winter Haven" "Winter Park"
## [1531] "WINTER PARK" "Woburn" "Woodbine"
## [1534] "Woodbridge" "Woodbury" "Woodcliff Lake"
## [1537] "Woodinville" "WOODINVILLE" "Woodland Hills"
## [1540] "WOODLAND HILLS" "Woods Cross" "Woodstock"
## [1543] "Worcester" "Worthington" "Wrightsville Beach"
## [1546] "Wylie" "Wyncote" "Wyoming"
## [1549] "Wyomissing" "Yakima" "Yardley"
## [1552] "Yonkers" "Yorba Linda" "York"
## [1555] "YORKTOWN" "Zelienople" "Zephyrhills"
## [1558] "Zionsville"
W zmiennej city występują wartości pozornie unikatowe - niektóre miasta są zapisane na różne sposoby (np. przez literówki).
Przy użyciu ChatGPT można wykryć takie przypadki i wygenerować odpowiedni kod, korygujący błędy. Dodatkowo przekonwertujemy dane tak, aby każde słowo było pisane wielką literą, po której następują małe litery:
inc$city <- str_to_title(inc$city)
inc$city <- sub(',.*','',inc$city)
inc$city <- sub(':Livermore','Livermore',inc$city)
inc$city <- sub("Ahaheim", "Anaheim", inc$city)
inc$city <- sub("Birmimgham", "Birmingham", inc$city)
inc$city <- sub("Covingtom", "Covington", inc$city)
inc$city <- sub("Ft Worth", "Fort Worth", inc$city)
inc$city <- sub("Ft. Lauderdale", "Fort Lauderdale", inc$city)
inc$city <- sub("Encinitass", "Encinitas", inc$city)
inc$city <- sub("Colorad Springs", "Colorado Springs", inc$city)
inc$city <- sub("Chicao", "Chicago", inc$city)
inc$city <- sub("Naperrville", "Naperville", inc$city)
inc$city <- sub("Wilminton", "Wilmington", inc$city)
inc$city <- sub("St. Louis Mo", "St. Louis", inc$city)
inc$city <- sub("St Louis", "St. Louis", inc$city)
inc$city <- sub("St Petersburg", "St. Petersburg", inc$city)
inc$city <- sub("San Diago", "San Diego", inc$city)
inc$city <- sub("San José", "San Jose", inc$city)
inc$city <- sub("St Paul", "St. Paul", inc$city)
inc$city <- sub("Santa Rosa Beach", "Santa Rosa", inc$city)
W ostatnim kroku jeszcze raz ułożymy kolumny w wyjściowej ramce danych i sprawdzimy strukturę danych.
inc <- select(inc,rank,name,industry,
revenue,three_years_growth_percent,workers,
founded,yrs_on_list,
state,metro,city, url)
str(inc)
## 'data.frame': 5012 obs. of 12 variables:
## $ rank : Factor w/ 4999 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ name : chr "Freestar" "FreightWise" "Cece's Veggie Co." "LadyBoss" ...
## $ industry : Factor w/ 27 levels "Advertising & Marketing",..: 1 19 11 5 23 13 5 26 13 1 ...
## $ revenue : num 36.9 33.6 24.9 32.4 22.5 ...
## $ three_years_growth_percent: num 36.7 30.5 23.9 21.9 18.2 ...
## $ workers : int 40 39 190 57 25 742 12 72 60 37 ...
## $ founded : num 2015 2015 2015 2014 2014 ...
## $ yrs_on_list : Factor w/ 14 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ state : Factor w/ 51 levels "AL","AR","AZ",..: 3 43 44 32 38 9 31 46 35 4 ...
## $ metro : chr "Phoenix" "Nashville" "Austin" NA ...
## $ city : chr "Phoenix" "Brentwood" "Austin" "Albuquerque" ...
## $ url : chr "http://freestar.com" "http://freightwisellc.com" "http://cecesveggieco.com" "http://ladyboss.com" ...
W ostatnim kroku skorygowany plik zostanie zapisany na dysku na potrzeby dalszych analiz:
write.csv(inc, file = "inc_corrected.csv",
row.names = FALSE, fileEncoding = "UTF-8")